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Abstract 

We  describe  two  approaches  to  analyzing  and 
tagging  team  discourse  using  Latent  Semantic 
Analysis  (LSA)  to  predict  team  performance. 
The  first  approach  automatically  categorizes 
the  contents  of  each  statement  made  by  each 
of  the  three  team  members  using  an  estab¬ 
lished  set  of  tags.  Performance  predicting  the 
tags  automatically  was  15%  below  human 
agreement.  These  tagged  statements  are  then 
used  to  predict  team  performance.  The  second 
approach  measures  the  semantic  content  of  the 
dialogue  of  the  team  as  a  whole  and  accu¬ 
rately  predicts  the  team’s  performance  on  a 
simulated  military  mission. 


1  Introduction 

The  growing  complexity  of  tasks  frequently  surpasses 
the  cognitive  capabilities  of  individuals  and  thus,  often 
necessitates  a  team  approach.  Teams  play  an  increas¬ 
ingly  critical  role  in  complex  military  operations  in 
which  technological  and  information  demands  require  a 
multi-operator  environment.  The  ability  to  automatically 
predict  team  performance  would  be  of  great  value  for 
team  training  systems. 

Verbal  communication  data  from  teams  provides  a 
rich  indication  of  cognitive  processing  at  both  the  indi¬ 
vidual  and  the  team  level  and  can  be  tied  back  to  both 
the  team’s  and  each  individual  team  member’s  abilities 
and  knowledge.  The  current  manual  analysis  of  team 
communication  shows  promising  results,  see  for  exam¬ 
ple,  Bowers  et  al.  (1998).  Nevertheless,  the  analysis  is 
quite  costly.  Hand  coding  for  content  is  time  consum¬ 
ing  and  can  be  highly  subjective.  Thus,  what  is  required 
are  techniques  for  automatically  analyzing  team  com¬ 
munications  in  order  to  categorize  and  predict  perform¬ 
ance. 


In  the  research  described  in  this  paper  we  apply  La¬ 
tent  Semantic  Analysis  (LSA),  to  measure  free-form 
verbal  interactions  among  team  members.  Because  it 
can  measure  and  compare  the  semantic  information  in 
these  verbal  interactions,  it  can  be  used  to  characterize 
the  quality  and  quantity  of  information  expressed.  This 
can  be  used  to  determine  the  semantic  content  of  any 
utterance  made  by  a  team  member  as  well  as  to  measure 
the  semantic  similarity  of  an  entire  team’s  communica¬ 
tion  to  another  team.  In  this  paper  we  describe  research 
on  developing  automated  techniques  for  analyzing  the 
communication  and  predicting  team  performance  using 
a  corpus  of  communication  of  teams  performing  simu¬ 
lated  military  missions.  We  focus  on  two  applications  of 
this  approach.  The  first  application  is  to  automatically 
predict  the  categories  of  discourse  for  each  utterance 
made  by  team  members  during  a  mission.  These  tagged 
statements  can  then  be  used  to  predict  overall  team  per¬ 
formance.  The  second  application  is  to  automatically 
predict  the  effectiveness  of  a  team  based  on  an  analysis 
of  the  entire  discourse  of  the  team  during  a  mission.  We 
then  conclude  with  a  discussion  of  how  these  techniques 
can  be  applied  for  automatic  communications  analysis 
and  integrated  into  training. 

2  Data 

Our  corpus  (UAV-Corpus)  consists  of  67  transcripts 
collected  from  11  teams,  who  each  completed  7  mis¬ 
sions  that  simulate  flight  of  an  Uninhabited  Air  Vehicle 
(UAV)  in  the  CERTT  (Cognitive  Engineering  Research 
on  Team  Tasks)  Lab's  synthetic  team  task  environment 
(CERTT  UAV-STE).  CERTT's  UAV-STE  is  a  three- 
team  member  task  in  which  each  team  member  is  pro¬ 
vided  with  distinct,  though  overlapping,  training;  has 
unique,  yet  interdependent  roles;  and  is  presented  with 
different  and  overlapping  information  during  the  mis¬ 
sion.  The  overall  goal  is  to  fly  the  UAV  to  designated 
target  areas  and  to  take  acceptable  photos  at  these  areas. 

The  67  team-at-mission  transcripts  in  the  UAV- 
Corpus  contain  approximately  2700  minutes  of  spoken 
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dialogue,  in  20545  separate  utterances  or  turns.  There 
are  approximately  232,000  words  or  660  KB  of  text.  All 
communication  was  manually  transcribed. 

We  were  provided  with  the  results  of  manual  anno¬ 
tation  of  the  corpus  by  three  annotators  using  the  Bow¬ 
ers  Tag  Set  (Bowers  et  al.  1998),  which  includes  tags 
for:  acknowledgement,  action,  factual,  planning,  re¬ 
sponse,  uncertainty,  and  non-task  related  utterances. 
The  three  annotators  had  each  tagged  26  or  27  team-at- 
missions  so  that  12  team-at-missions  were  tagged  by 
two  annotators.  Inter-coder  reliability  had  been  com¬ 
puted  using  the  C-value  measure  (Schvaneveldt,  1990). 
The  overall  C-value  for  transcripts  with  two  taggers  was 
0.70.  We  computed  Cohen’s  Kappa  to  be  0.62  (see  Sec¬ 
tion  4  and  Table  1). 

In  addition  to  the  moderate  level  inter-coder  agree¬ 
ment,  tagging  was  done  at  the  turn  level,  where  a  turn 
could  range  from  a  single  word  to  several  utterances  by 
a  single  speaker,  and  the  number  of  tags  that  taggers 
assigned  to  a  given  turn  might  not  agree.  We  hope  to 
address  these  limitations  in  the  data  set  with  a  more 
thorough  annotation  study  in  the  near  future. 

3  Latent  Semantic  Analysis 

LSA  is  a  fully  automatic  corpus-based  statistical  method 
for  extracting  and  inferring  relations  of  expected  con¬ 
textual  usage  of  words  in  discourse  (Landauer  et  al., 
1998). 

LSA  has  been  used  for  a  wide  range  of  applications 
and  for  simulating  knowledge  representation,  discourse 
and  psycholinguistic  phenomena.  These  approaches 
have  included:  information  retrieval  (Deerwester  et  al., 
1990),  and  automated  text  analysis  (Foltz,  1996).  In 
addition,  LSA  has  been  applied  to  a  number  of  NLP 
tasks,  such  as  text  segmentation  (Choi  et  al.,  2001). 
More  recently  Serafin  et  al.  (2003)  used  LSA  for  dia¬ 
logue  act  classification,  finding  that  LSA  can  effectively 
be  used  for  such  classification  and  that  adding  features 
to  LSA  showed  promise. 

To  train  LSA  we  added  2257  documents  to  the  cor¬ 
pus  UAV  transcripts.  These  documents  consisted  of 
training  documents  and  pre-  and  post-training  inter¬ 
views  related  to  UAVs,  resulting  in  a  total  of  22802 
documents  in  the  final  corpus.  For  the  UAV-Corpus  we 
used  a  300  dimensional  semantic  space. 

4  Automatic  Discourse  Tagging 

Our  goal  was  to  use  semantic  content  of  team  dialogues 
to  better  understand  and  predict  team  performance.  The 
approach  we  focus  on  here  is  to  study  the  dialogue  on 
the  turn  level.  Working  within  the  limitations  of  the 
manual  annotations,  we  developed  an  algorithm  to  tag 
transcripts  automatically,  resulting  in  some  decrease  in 


performance,  but  a  significant  savings  in  time  and  re¬ 
sources. 

We  established  a  lower  bounds  tagging  performance 
of  0.27  by  computing  the  tag  frequency  in  the  12  tran¬ 
scripts  tagged  by  two  taggers.  If  all  utterances  were 
tagged  with  the  most  frequent  tag,  the  percentage  of 
turns  tagged  correctly  would  be  27%. 

Automatic  Annotation  with  LSA.  In  order  to  test  our 
algorithm  to  automatically  annotate  the  data,  we  com¬ 
puted  a  "corrected  tag"  for  all  2916  turns  in  the  12  team- 
at-mission  transcripts  tagged  by  two  taggers.  This  was 
necessary  due  to  the  only  moderate  agreement  between 
the  taggers.  We  used  the  union  of  the  sets  of  tags  as¬ 
signed  by  the  taggers  as  the  "corrected  tag". 

The  union,  rather  than  the  intersection,  was  used 
since  taggers  sometimes  missed  relevant  tags  within  a 
turn.  The  union  of  tags  assigned  by  multiple  taggers 
better  captures  all  likely  tag  types  within  the  turn.  A 
disadvantage  to  using  “corrected  tags”  is  the  loss  of 
sequential  tag  information  within  individual  turns. 
However  the  focus  of  this  study  was  on  identifying  the 
existence  of  relevant  discourse,  not  on  its  order  within 
the  turn. 

Then,  for  each  of  the  12  team-at-mission  transcripts, 
we  automatically  assigned  "most  probable"  tags  to  each 
turn,  based  on  the  corrected  tags  of  the  "most  similar" 
turns  in  the  other  1 1  team-at-missions.  For  a  given  turn, 
T,  the  algorithm  proceeds  as  follows: 

Find  the  turns  in  the  other  1 1  team-at-mission  tran¬ 
scripts,  whose  vectors  in  the  semantic  space  have  the 
largest  cosines,  when  compared  with  T's  vector  in  the 
semantic  space.  We  choose  either  the  ones  with  the  top 
n  (usually  top  10)  cosines,  or  the  ones  whose  cosines  are 
above  a  certain  threshold  (usually  0.6).  The  corrected 
tags  for  these  "most  similar"  turns  are  retrieved.  The 
sum  of  the  cosines  for  each  tag  that  appears  is  computed 
and  normalized  to  give  a  probability  that  the  tag  is  the 
corrected  tag.  Finally,  we  determine  the  predicted  tag  by 
applying  a  cutoff  (0.3  and  0.4  seem  to  produce  the  best 
results):  all  of  the  tags  above  the  cutoff  are  chosen  as 
the  predicted  tag.  If  no  tag  has  a  probability  above  the 
cutoff,  them  the  single  tag  with  the  maximum  probabil¬ 
ity  is  chosen  as  the  predicted  tag. 

We  also  computed  the  average  cosine  similarity  of  T 
to  its  10  closest  tags  as  a  measure  of  certainty  of  catego¬ 
rization.  For  example,  if  T  is  not  similar  to  any  previ¬ 
ously  categorized  turns,  then  it  would  have  a  low 
certainty.  This  permits  the  flagging  of  turns  that  the 
algorithm  is  not  likely  to  tag  as  reliability. 

In  order  to  improve  our  results,  we  considered  ways 
to  incorporate  simple  discourse  elements  into  our  pre¬ 
dictions.  We  added  two  discourse  features  to  our  algo¬ 
rithm:  for  any  turn  with  a  question  mark,  "?",  we 
increased  to  probability  that  uncertainty,  "U",  would  be 
one  of  the  tags  in  its  predicted  tag;  and  for  any  turn  fol- 


lowing  a  turn  with  a  question  mark,  "?",  we  inereased  to 
probability  that  response,  "R",  would  be  one  of  the  tags 
in  its  predieted  tag. 

We  refer  to  our  original  algorithm  as  “LSA”  and  our 
algorithm  with  the  two  diseourse  features  added  as 
“LSA+”.  Using  LSA+  with  our  two  methods  now  per¬ 
forms  only  11%  and  15%  below  human-human  agree¬ 
ment  (see  Table  1). 

We  realize  that  training  our  system  on  tags  where 
humans  had  only  moderate  agreement  is  not  ideal.  Our 
failure  analyses  indieated  that  the  distinetions  our  algo¬ 
rithm  has  diffieulty  making  are  the  same  distinetions 
that  the  humans  found  diffieult  to  make,  so  we  believe 
that  improved  agreement  among  human  annotators 
would  result  in  similar  improvements  for  our  algorithm. 

The  results  suggest  that  we  ean  automatieally  anno¬ 
tate  team  transeripts  with  tags.  While  the  approaeh  is 
not  quite  as  aeeurate  as  human  taggers,  LSA  is  able  to 
tag  an  hour  of  transeripts  in  under  a  minute.  As  a  eom- 
parison,  it  ean  take  half  an  hour  or  longer  for  a  trained 
tagger  to  do  the  same  task. 

Measuring  Agreement.  The  C-value  measures  the  pro¬ 
portion  of  inter-eoder  agreement,  but  does  not  take  into 
aeeount  agreement  by  ehanee.  In  order  to  adjust  for 
ehanee  agreement  we  eomputed  Cohen’s  Kappa  (Cohen 
1960),  as  shown  in  Table  1. 


CODERS-AGREEMENT 

C-VALUE 

KAPPA 

Human-Human 

0.70 

0.62 

LSA-Human 

0.59 

0.48 

LSA-T-Human 

0.63 

0.53 

Table  1.  Kappa  and  C-Va 

ues. 

5  Predicting  Overall  Team  Performance 

Throughout  the  CERTT  Lab  UAV-STE  missions  a  per- 
formanee  measure  was  ealeulated  to  determine  eaeh 
team’s  effeetiveness  at  eompleting  the  mission.  The 
performanee  seore  was  a  eomposite  of  objeetive  meas¬ 
ures  ineluding:  amount  of  fuel/fdm  used,  number/type 
of  photographie  errors,  time  spent  in  warning  and  alarm 
states,  and  un-visited  waypoints.  This  eomposite  seore 
ranged  from  0  to  1000.  The  seore  is  highly  predietive  of 
how  well  a  team  sueeeeded  in  aeeomplishing  their  mis¬ 
sion.  We  used  two  approaehes  to  prediet  these  overall 
team  performanee  seores:  eorrelating  the  tag  frequencies 
with  the  scores  and  by  correlating  entire  mission  tran¬ 
scripts  with  one  another. 

Team  Performance  Based  on  Tags.  We  computed 
correlations  between  the  team  performance  score  and 
tag  frequencies  in  each  team-at-mission  transcript. 

The  tags  for  all  20545  utterances  were  first  gener¬ 
ated  using  the  LSA-l-  method.  The  tag  frequencies  for 


each  team-at-mission  transcript  were  then  computed  by 
counting  the  number  of  times  each  individual  tag  ap¬ 
peared  in  the  transcript  and  dividing  by  the  total  number 
of  individual  tags  occurring  in  the  transcript. 

Our  preliminary  results  indicate  that  frequency  of 
certain  types  of  utterances  correlate  with  team  perform¬ 
ance.  The  correlations  for  tags  predicted  by  computer 
are  shown  in  Table  2. 


TAG 

PEARSON 

CORRELATION 

SIG. 

2-TAILED 

Acknowledgement 

0.335 

0.006 

Fact 

0.320 

0.008 

Response 

-0.321 

0.008 

Uncertainty 

-0.460 

0.000 

Table  2.  Tag  to  Performance  Correlations. 


Table  2  shows  that  the  automated  tagging  provides 
useful  results  that  can  be  interpreted  in  terms  of  team 
processes.  Teams  that  tend  to  state  more  facts  and  ac¬ 
knowledge  other  team  members  more  tend  to  perform 
better.  Those  that  express  more  uncertainty  and  need  to 
make  more  responses  to  each  other  tend  to  perform 
worse.  These  results  are  consistent  with  those  found  in 
Bowers  et  al.  (1998),  but  were  generated  automatically 
rather  than  by  the  hand-coding  done  by  Bowers. 

Team  Performance  Based  on  Whole  Transcripts. 

Another  approach  to  measuring  content  in  team  dis¬ 
course  is  to  analyze  the  transcript  as  a  whole.  Using  a 
method  similar  to  that  used  to  score  essays  with  LSA 
(Landauer  et  al.  1998),  we  used  the  transcripts  to  predict 
the  team  performance  score.  We  generate  the  predicted 
team  performance  scores  was  as  follows:  Given  a  sub¬ 
set  of  transcripts,  S,  with  known  performance  scores, 
and  a  transcript,  t,  with  unknown  performance  score,  we 
can  estimate  the  performance  score  for  t  by  computing 
its  similarity  to  each  transcript  in  S.  The  similarity  be¬ 
tween  any  two  transcripts  is  measured  by  the  cosine 
between  the  transcript  vectors  in  the  UAV-Corpus  se¬ 
mantic  space.  To  compute  the  estimated  score  for  t,  we 
take  the  average  of  the  performance  scores  of  the  10 
closest  transcripts  in  S,  weighted  by  cosines.  A  holdout 
procedure  was  used  in  which  the  score  for  a  team’s  tran¬ 
script  was  predicted  based  on  the  transcripts  and  scores 
of  all  other  teams  (i.e.  a  team’s  score  was  only  predicted 
by  the  similarity  to  other  teams).  Our  results  indicated 
that  the  LSA  estimated  performance  scores  correlated 
strongly  with  the  actual  team  performance  scores  (r  = 
0.76,  p  <  0.01),  as  shown  in  Figure  1.  Thus,  the  results 
indicate  that  we  can  accurately  predict  the  overall  per¬ 
formance  of  the  team  (i.e.  how  well  they  fly  and  com¬ 
plete  their  mission)  just  based  on  an  analysis  of  their 
transcript  from  the  mission. 


Actual  Team  Performance  score 

Figure  1.  Correlation:  Predicted  and  Actual  Team 
Performance. 

6  Conclusions  and  Future  Work 

Overall,  the  results  of  the  study  show  that  LSA  can  be 
used  for  tagging  content  as  well  as  predicting  team  per¬ 
formance  based  on  team  dialogues.  Given  the  limita¬ 
tions  of  the  manual  annotations,  the  results  from  the 
tagging  portion  of  the  study  are  still  comparable  to  other 
efforts  of  automatic  discourse  tagging  using  different 
methods  and  different  corpora  (Stolcke  et  ah,  2000), 
which  found  performance  within  15%  of  the  perform¬ 
ance  of  human  taggers.  We  plan  to  conduct  a  more  rig¬ 
orous  manual  annotation  study.  We  expect  that 
improved  human  inter-coder  reliability  would  eliminate 
the  need  for  “corrected  tags”  and  allow  for  sequential 
analysis  of  tags  within  turns.  It  is  also  anticipated  that 
incorporating  additional  methods  that  account  for  syntax 
and  discourse  turns  should  further  improve  the  overall 
performance,  see  also  Serafin  et  al.  (2003). 

Even  with  the  limitations  of  the  discourse  tagging, 
our  LSA-based  approach  demonstrates  it  can  be  applied 
as  a  method  for  doing  automated  measurement  of  team 
performance.  Using  automatic  methods  we  were  able  to 
duplicate  some  of  the  results  of  Bowers,  and  colleagues, 
(1998)  who  analyzed  the  sequence  of  content  categories 
occurring  in  communication  in  a  flight  simulator  task. 
They  found  that  high  team  effectiveness  was  associated 
with  consistent  responding  to  uncertainty,  planning,  and 
fact  statements  with  acknowledgments  and  responses. 

The  LSA-predicted  team  performance  scores  corre¬ 
lated  strongly  with  the  actual  team  performance  meas¬ 
ures.  This  demonstrates  that  analyses  of  discourse  can 
automatically  measure  how  well  a  team  is  performing 
on  a  mission.  This  has  implications  both  for  automati¬ 
cally  determining  what  discourse  characterizes  good  and 
poor  teams  as  well  as  developing  systems  for  monitor¬ 
ing  team  performance  in  near  real-time.  We  are  cur¬ 
rently  exploring  two  promising  avenues  to  predict 
performance  in  real  time:  integration  of  speech  recogni¬ 
tion  technology,  and  inter-tum  tag  sequences. 

Research  into  team  discourse  is  a  new  but  growing 
area.  However,  up  to  recently,  the  large  amounts  of 
transcript  data  have  limited  researchers  from  performing 


analyses  of  team  discourse.  The  results  of  this  study 
show  that  applying  NLP  techniques  to  team  discourse 
can  provide  accurate  predictions  of  perfomiance.  These 
automated  tools  can  help  inform  theories  of  team  per¬ 
formance  and  also  aid  in  the  development  of  more  ef¬ 
fective  automated  team  training  systems. 
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